Skip to content

uldagalihan/distributed-cve-search-engine

Repository files navigation

Distributed Cybersecurity Vulnerability Search Engine

A Hadoop-based vulnerability search engine that builds a distributed TF-IDF index over NVD CVE records and provides an interactive Java Swing interface for ranked CVE search.

The system uses Hadoop HDFS for distributed storage, MapReduce for offline indexing, and an in-memory Java search layer for interactive query serving. It was designed around NVD CVE data normalized into JSON Lines, where each line is one independent CVE document.

System architecture

Highlights

  • Distributed storage of CVE JSONL datasets in HDFS
  • MapReduce pipeline for tokenization, inverted index construction, and TF-IDF scoring
  • Java Swing GUI with HDFS file management, MapReduce job monitoring, and CVE search
  • AND/OR query mode, Top-N search, Show All Results, sortable result columns, and detailed raw JSON view
  • Hadoop configuration and preprocessing scripts included for reproducibility

User Interface

Search Interface

Search Interface

CVE Detail View

Search Interface Detailed

Repository Structure

src/                 Java source code for MapReduce jobs, CLI search, and Swing GUI
preprocessing/       Python scripts for converting NVD CVE JSON files to JSONL datasets
hadoop-config/       Hadoop XML configuration files used in the 2-node VM cluster
dist/                Built project JAR
screenshots/         GUI, architecture, HDFS UI, and YARN UI screenshots
datasets/            Dataset documentation only; large JSONL files are gitignored
raw-nvd-json/        Raw NVD JSON documentation only; raw JSON files are gitignored

The technical report is included as BLM4821_CVE_Engine_Report.pdf.

Data Source and Dataset Variants

The project uses NVD CVE data. Raw CVE JSON files were obtained from the community-maintained fkie-cad/nvd-json-data-feeds repository, which mirrors/reconstructs NVD JSON data feed packages from NVD data:

https://github.com/fkie-cad/nvd-json-data-feeds

The raw JSON files were normalized into JSON Lines format. One JSONL line corresponds to one CVE record and one searchable document. The implementation was tested on CVE data from 2018-2025, but the preprocessing and indexing pipeline can be applied to other NVD year ranges as well.

Dataset Size Document count HDFS raw path
Compact 107.5 MB 222,083 /raw/cve_2018_2025_compact.jsonl
Enriched 283.1 MB 222,083 /raw/cve_2018_2025_enriched.jsonl
Large 523.6 MB 222,083 /raw/cve_2018_2025_500mb.jsonl

Large raw files are intentionally excluded from GitHub. The local datasets/ and raw-nvd-json/ directories are ignored because they are hundreds of megabytes to more than one gigabyte. Use the scripts under preprocessing/ to recreate the JSONL datasets from raw NVD JSON files.

Hadoop Cluster Environment

  • Host: Windows laptop running VirtualBox
  • Guest OS: Ubuntu Server 22.04
  • Hadoop: 3.4.1
  • Java: 11
  • Cluster layout:
    • hadoopmaster / 192.168.56.101: NameNode, ResourceManager, DataNode, NodeManager
    • hadoop-worker1 / 192.168.56.102: DataNode, NodeManager

Monitoring interfaces used during development:

  • HDFS NameNode UI: http://192.168.56.101:9870
  • YARN ResourceManager UI: http://192.168.56.101:8088/cluster

Hadoop Configuration

The hadoop-config/ directory contains the cluster configuration used for this project:

  • core-site.xml: default filesystem, e.g. hdfs://hadoopmaster:9000
  • hdfs-site.xml: HDFS directories, replication, NameNode/DataNode settings
  • mapred-site.xml: MapReduce configured to run on YARN
  • yarn-site.xml: ResourceManager and NodeManager addresses
  • hadoop-env.sh: Hadoop environment variables such as JAVA_HOME
  • workers: worker node list

Configuration files were prepared on the master node and copied to the worker node using scp.

MapReduce Pipeline

Job 1: CVE Tokenizer

Reads raw JSONL CVE records, extracts relevant fields, normalizes text, and emits one tokenized document per CVE.

hadoop jar dist/cve-search.jar com.cvesearch.CveTokenizerJob \
  /raw/cve_2018_2025_compact.jsonl /tokens/compact

Job 2: Inverted Index

Builds posting lists from tokenized documents.

hadoop jar dist/cve-search.jar com.cvesearch.InvertedIndexJob \
  /tokens/compact /index/compact

Output format:

term -> CVE:tf,CVE:tf,...

Job 3: TF-IDF Index

Computes TF-IDF scores for each term-document pair.

hadoop jar dist/cve-search.jar com.cvesearch.TfIdfJob \
  /index/compact /tfidf/compact 222083

Formula:

TF-IDF = TF * log(N / DF)

where N is the document count and DF is the number of documents containing the term.

Running the GUI

The GUI loads the TF-IDF index and raw CVE JSONL records from HDFS into memory. Use a larger heap for enriched or large datasets.

java -Xmx2g -cp "dist/cve-search.jar:$(hadoop classpath)" com.cvesearch.CveSearchGUI

For the large dataset:

java -Xmx2500m -cp "dist/cve-search.jar:$(hadoop classpath)" com.cvesearch.CveSearchGUI

GUI modules:

  • HDFS File Manager: browse HDFS, upload VM file to HDFS, download HDFS file to VM, delete, refresh
  • MapReduce Job Monitor: run Tokenizer, Inverted Index, TF-IDF, or Full Pipeline
  • Search Interface: load index, search CVEs, sort results, inspect detailed CVE records

Search Model

Search does not run MapReduce. MapReduce is used only for offline index construction. At query time, the GUI reads the precomputed TF-IDF index from HDFS into memory, combines posting lists using AND/OR logic, ranks CVEs by accumulated TF-IDF score, and displays Top-N or all matches.

Example search results:

Dataset Query Mode Display Matches Shown Query time
Compact apache AND Show All 1910 1910 0.591
Large apache AND Top-N 50 1934 50 0.571 s

Measured MapReduce Execution Times

Dataset Tokenizer Inverted Index TF-IDF Total
Compact 107.5 MB 5m58.959s 11m29.182s 3m29.980s 20m58.121s
Enriched 283.1 MB 6m07.060s 9m18.300s 4m26.863s 19m52.223s
Large 523.6 MB 5m57.599s 9m08.788s 2m45.042s 17m51.429s

The runtime does not increase monotonically with raw dataset size because this small virtualized Hadoop cluster is affected by HDFS block placement, input splits, disk cache, JVM warm-up, current VM load, YARN scheduling overhead, and output characteristics.

Notes on Large Files

GitHub has practical file and repository size limits. Raw NVD JSON files and generated JSONL datasets are excluded from version control. The repository keeps the source code, preprocessing scripts, Hadoop configuration, screenshots, built JAR, and report while documenting how to recreate the data locally.

About

Distributed search engine for NVD CVE data using Hadoop HDFS, MapReduce, TF-IDF indexing, and a Java Swing GUI.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors